Fix for slow the bug tokenizer adding spaces to single id decodes#32564
Fix for slow the bug tokenizer adding spaces to single id decodes#32564itazap merged 25 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @itazap as well! |
itazap
left a comment
There was a problem hiding this comment.
Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!
No worries, I'll do the changes 😉 |
|
@ArthurZucker and @LysandreJik merge time please 😉 |
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
|
Gentle ping @itazap , can we do the merge? Some commits from the main was failing this branch but looks like all fixed , can we do the merge before any more breaking changes come 😁 😁 😬 |
|
@DuyguA Sorry for the delay! Merged !! 🚀 Thanks for working on this 🤗 |
|
🎉Congrats @DuyguA !This issue has really been a long journey. |
…ggingface#32564) * _decode signature change and quick return * added bunch of decoding tests * signature match and return * added tests for decoding * merged decoding test * more tests for special tokens * cosmetics * fixed param * ruffed the file * refinement for single special tokens * added test for single special tokens * slight change to test name Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> * minor change test name for skip tokens Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> * killed already defined var Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> * minor update with vars Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> * killed already defined var once more Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> --------- Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
What does this PR do?
Quick fix for a bug with the tokenizer, slow tokenizers add spaces in between when the input is a single id.
Fixes #29489
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker